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Abstract. We identify the classical Perceptron algorithm with margin 
as a member of a broader family of large margin classifiers which we 
collectively call the Margitron. The Margitron, (despite its) sharing the 
same update rule with the Perceptron, is shown in an incremental setting 
to converge in a finite number of updates to solutions possessing any 
desirable fraction of the maximum margin. Experiments comparing the 
Margitron with decomposition SVMs on tasks involving linear kernels 
and 2-norm soft margin are also reported. 



1 Introduction 

It is widely accepted that the larger the margin of the solution hypcrplane the 
greater is the generalisation ability of the learning machine [18, 14]. The simplest 
online learning algorithm for binary linear classification, the Perceptron [12, 11], 
does not aim at any margin. The problem, instead, of finding the optimal margin 
hypcrplane lies at the core of Support Vector Machines (SVMs) [18,1]. Their 
efficient implementation, however, is somewhat hindered by the fact that they 
require solving a quadratic programming problem. 

The complications encountered in implementing SVMs has respurred the in- 
terest in alternative large margin classifiers many of which are based on the 
Perceptron algorithm. The oldest such algorithm which appeared long before 
the advent of SVMs is the standard Perceptron with margin [2], a straightfor- 
ward extension of the Perceptron, which, however, in an incremental setting is 
known to be able to guarantee achieving only up to 1/2 of the maximum margin 
that the dataset possesses [8, 10, 15]. Subsequently, various algorithms succeeded 
in achieving larger fractions of the maximum margin by employing modified 
perceptron-like update rules. Such algorithms include ROMMA [9], ALMA [3], 
CRAMMA [16] and MICRA [17]. A somewhat different approach from the hard 
margin one adopted by most of the algorithms above was also developed which 
focuses on the minimisation of the 1-norm soft margin loss through stochastic 
gradient descent. There is a connection, however, between such algorithms and 
the Perceptron since their unregularised form with constant learning rate is iden- 
tical to the Perceptron with margin. Notable representatives of this approach are 
the pioneer NORMA [7] and the very recent Pegasos [13]. 
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A question that arises naturally and which we attempt to answer in the 
present work is whether it is possible to achieve a guaranteed fraction of the 
maximum margin larger than 1 /2 while retaining the original perceptron update 
rule. To this end we construct a whole new family of algorithms at least one 
member of which has guaranteed convergence in a finite number of steps to a 
solution hyperplane possessing any desirable fraction of the unknown maximum 
margin. This family of algorithms in which the classical Perceptron with margin 
is naturally embedded will be termed the Margitron. Hopefully, the; algorithms 
belonging to the margitron family by virtue of being generalisations of the very 
successful Perceptron will have a respectable performance in various classification 
tasks. 

Section 2 contains some preliminaries and the description of the Margitron 
algorithm. Section 3 is devoted to a theoretical analysis. Section 4 contains our 
experimental results while Section 5 our conclusions. 

2 The Margitron Algorithm 

In what follows we assume that we are given a training set which either is linearly 
separable from the beginning or becomes separable by an appropriate feature 
mapping into a space of a higher dimension [18,1]. This higher dimensional 
feature space in which the patterns are linearly separable will be the considered 
space. By placing all patterns in the same position at a distance p in an additional 
dimension we construct an embedding of our data into the so-called augmented 
space [2]. The advantage of this embedding is that the linear hypothesis in the 
augmented space becomes homogeneous. Throughout our discussion a reflection 
with respect to the origin in the augmented space of the negatively labelled 
patterns is assumed in order to allow for a uniform treatment of both categories 
of patterns. Also, R = max lly^H, with y^. the k^^ augmented pattern. Obviously, 

R>p. 

The relation characterising optimally correct classification of the training 
patterns y^. by a weight vector u of unit norm in the augmented space is 

u-yk>ld= max min {u' ■ yj VA; . (1) 

We shall refer to 7ci as the maximum directional margin. It coincides with the 
maximum margin in the augmented space with respect to hyperplanes passing 
through the origin if no reflection is assumed. The directional margin 7d and 
the maximum geometric margin 7 in the original (non-augmented) feature space 
satisfy the inequality 

1 < ihd < R/p ■ 

As p — > 00, R/p 1 and from the above inequality 7d ^ 7 [15]. 

In the Margitron algorithm the augmented weight vector at is initially set to 
zero, i.e. oq = 0, and is updated according to the classical perceptron rule 



at+i =at + yk 



(2) 
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each time a misclassification condition is satisfied by a training pattern y^. For 
the misclassification condition we consider two options. The first is to replace 
the constant functional margin threshold & > in the misclassification condition 
of the classical Perceptron with margin by a term proportional to a power of the 
number of steps (updates) t 

afyk<bt^-' , e>0 . (3) 

As a second option we employ a margin threshold proportional to a power of the 
length of the augmented weight vector leading to a misclassification condition 

afyk<b\\atf-' , e>0 . (4) 

For t = in both (3) and (4) the threshold is set to resulting in the first 
pattern being always misclassified. The Margitron with misclassification condi- 
tion given by (3) will be referred to as the i-margitron whereas the version with 
condition given by (4) as the ^-margitron. Setting e = 1 in both the t- and the 
^-margitron wo recover the Perceptron with margin. Notice that the introduction 
of a constant learning rate is pointless since it amounts to a rescaling of b. 



t-margitron 



Input: A linearly separable augmented 

set 5 = with 

reflection assumed 

Fix: €, 6 

Define: e = 1 — e 

Initicdise: t = 0, ag = 0, bo = 

repeat 

for = 1 to m do 
Ptk = at ■ 
if Ptk < h then 
at+i =at + y,, 
t'^t + l 
bt = bt' 
end if 
end for 

until no update made within the for loop 



i!-margitron 



Input: A linearly separable augmented 
set 5= with 
reflection assumed 
Fix: e, b 

Define: gjfe = 11^, e=i(l-e) 
Initialise: t = 0, Oq = 0, io = 0, bo = 
repeat 

for fc = 1 to m do 

Ptk = at ■ Vt, 
if Ptk < ht then 

at+i = at + 

ft+i = 4 + 2ptfe + Qk 

t^t + l 

bt = bll 
end if 
end for 

until no update made within the for loop 



Fig. 1. The algorithms t-margitron and ^-margitron. 



Both (3) and (4) can be written for t > in the form 

ut yk< C{t) (5) 

{ut = at/ \\at\\ , C{t) > 0) involving the margin Ut -yj, in the augmented space of 
the pattern y^. with respect to the zero-threshold hyperplane normal to at (i.e. 
the directional margin of y^) instead of its functional margin at y^- The function 
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C{t) is given by C{t) = bt^~^ \\at\\~^ for the t-margitron and by C{t) = b \\at\\~^ 
for the ^-margitron. We expect that e < 1 will result in an enhancement of the 
margin threshold C{t) relative to the case e = 1 (Perceptron with margin) and 
that this enhancement will eventually lead to a slower average fall off of C{t) 
with t progressing instead of a genuine increase which is desirable in order for the 
algorithm to converge. This expectation is further supported by the fact that, as 
we demonstrate below, C{t) < ct~^ with c > 0. Hopefully, such a slower decrease 
of the margin required by the misclassification condition will ensure convergence 
to solutions possessing margins which are larger fractions of 74. 

Taking the inner product of (2) with the optimal direction u we obtain 

at+i ■ u - at - u = y^. - u>-iA 

a repeated application of which gives [11] 

||at|| > at • ti > 7di . (6) 

Using (6) we get C{t) < ct~'^ with c = 67^^ and c = 67^^ for the t- and the 
^-margitron, respectively. 

3 Theoretical Analysis 
Lemma 1. Let 

g{t) =1" - at"-^ - (3 

with t € [1, +00), e > 0, a > 1 and /? > 0. Then, there is a single value t^, oft 
satisfying 

9{tb) = 

which is bounded as follows 

1 1 1 

a + f3e <tb<-a + l3e e < 1 

1 i i 

ja + l3e<tb<a + pe e>l. 

Proof. The function g{t) with g{l) < is unbounded from above and is either 
strictly increasing (if e < 1) or has at most one local minimum (if e > 1). 
Therefore, there is a single root t^ of q{t). In addition, for g{t) 7^ sign(t — tb) = 

1 1 i 1 e-l 

sign(g(i)). Let < 6 < 1. We have 5i(a + /3e) = /3e (a + /3£)"-i -/3 < e - 
/3 = 0, implying that tt, > a + l3e . Moreover, g(—a + /Sa) — {—^a + ){-0i + 
p^y-^ - /? = /3(1 + ^al3-\){l + \ap-~eY-^ - /? > /3(1 + ^a/r e )i-'^(l + 

1 -- 1 i 

-a/3 € )<=~i — /3 = implying that t\y < -a + /3 e . (Here we make use of 1 + > 
(1 + zy for -1 < 2; 7^ and < g < 1.) If e > 1, instead, both the above 
inequalities are reversed. (Here we make use of 1 — < (1 + z)~'^ for z,q> 0.) 
Finally, for e = 1 obviously t^ = a + p. r-\ 
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Theorem 1. The t-margitron with < e < 1 converges in 



t < + 



updates to a solution hyperplane possessing directional margin 7^ which is a 
fraction f of the maximum directional margin 7d obeying the inequality 

Moreover, an after-running estimate of is obtainable from 



11 
7d 



>/est-(fr^ + 2^)" . (9) 

Proof. Prom (2) and taking into account (3) we get 

||at+i||^ - llatll^ = llyfell^ + 2yfe ■ at < B? + 2bt^-' 
a repeated application t times of which leads to 

||at 11^ < Rh + 26 V Z^-*^ < R^t + 2b l^'^dl = RH + ^^ht^'" ■ (10) 
1=1 -^0 

Combining (6) with (10) we obtain 

7di < llatll < R\Jt+^^-^^' (11) 

from where 
or, equivalently, 

g{t) = t--^J--^-^^^^<Q . (13) 

The value t^ of t for which the above relation holds as an equality provides an 
upper bound on the number of updates required for convergence. According 
to Lemma 1 there is a single such value which is bounded as stated there. This 
leads to the looser bound of (7). 

Combining (3) with (5) and using (10) we obtain 

c{t) _ bt'-' ^ /7di? ^2.-1 I 2 b y' 

7d 7dlla*ll - V b + J ■ 

Multiplying both sides of (12) with its r.h.s. we get 
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or 

Using this last inequality and taking into account that / = 7^/7^ > C(tc)/7, 
(14) leads to (9). Setting tc = 1 in (9) we obtain the weaker bound of (8). £ 



Remark 1. Noticing that the number of updates required for convergence of 
the t-margitron satisfies (12) we get 



7d < R\]tc^ + ^eW*^' 

from where an alternative after- running lower boimd on 7^/7d is obtainable. 
This bound, however, does not have to be smaller than 1 — f • 

Remark 2. The r.h.s. of (14) has in the interval [1, +00) a single extremum, which 

is a maximum, at U = — 2e|(2 — e)(2e)^^^^ ^ sign(l — 2e). Therefore, it 
is legitimate in calculating a lower bound on C(tc)/7d using (14) to replace tc 
with th provided tc > f*. This leads to the stronger than the one of (8) bound 

which, however, is 7d-dependent. The condition tc > is automatically satis- 
fied for ^ < e < 1. For < e < i, instead, we may ensure that tc > t* if 
the r.h.s. of (14) is larger than or equal to 1 for t = 1 and as a consequence 
the normalised margin threshold C{t) is initially not lower than the maximum 
directional margin 7d, i.e. C(l) > 7d. A condition sufficient for this to be the 

case is > (^1 + 2^"^) • ^^^^ event the algorithm is forced to converge 
only after C{t) has fallen bellow 7d which cannot occur as long as t < ti,. li we 
choose 

1 

and replace in (15) th with its lower bound tih = ^^^'p') ^ ~ 2^"^"^"^' 
which is lower than the lower bound inferred from Lemma 1, we can easily 
verify that f > (^S + 2^7) . If < e < | the parameter S should satisfy the 

l-e 1_2 _1 

constraint 5 < (1 - I) ' (if)' (l+2^lt) ' which for < e «: 5 and 

^ <i; 1 suggests a rather slow convergence. Thus, it is not advisable in this case 
to employ values of b for which the constraint on S is satisfied. The algorithm 
will still be able to achieve a large fraction of 7d if it happens to converge in a 
sufficiently large number of updates tc as it can be deduced from (9). 
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Lemma 2. For x,y > and —l<e<lit holds that 

Proof. For e = 1 (17) holds obviously as an equality. For —1 < e < 1 (17) is 
equivalent to 1 < ^^a^~* + ^^^a~^^^'^\ with a = x/y. The r.h.s. of the above 
inequality is minimised for a = 1 and takes the value 1. q 

Lemma 3. For t > 1 and < e < 1 it holds that 

< t^lnty-^ - [e] , (18) 
where [e] denotes the integer part of e. 

Proof. For f = lore = l(18) holds obviously as an equality. Let t > 1 and 
< e < 1. Then, with x = (18), as a strict inequality, is equivalent to 

/(e) = e'^a;(lna;)^~'^ ~ x + 1 > 0. For x > e'^ wc have ^ < from where 
/(e) > lim/(e) = 1. For 1 < x < e''\ instead. / has only one local minimum 

e— »1 

at e = In a; with value at that minimum given by h{x) = x^^ Inx — x + 1. 
It can be easily shown that ^ — (1 ~ e)x~^ ^ lnx~* ^ + x~^ ^ — 1 has no 

local minima in the interval (l,e'=). Thus, 4^ > minjlim 4^, lim 4^1 = 0. 

ax — ^1 ax X — ^e*^ ax j 

Therefore, h{x) > lim h{x) = and consequently /(e) > 0. r-i 

X— >1 



Lemma 4. Let 

g{t)=t^-(a, (^]fy^\a2t-^)-P 

with t G [1, +oo), < e < 1, ai,a2,P > and a = ai + a2 > 2 + e. Then, 
g{to) > with 

to = {^a + p-e)(^H7a + /3-e)^ . 

Proof. Let A = a/{a + e/3e ) < 1 and a; = In ^ > C = In (l + |) > 1 such that 
to = f^x^-'' > 1 + f > e, Into =x+{l-e)]iLx>l, P = a^(l - A)7(Ae)^ and 
^2^0^ < Q!2 (IntoAo)^"*- Then, 

gito) >tl-a (Ifl) - /? = - Ae (l + (1 - e)l^)"^ - 
> 1 - Ae(l + (1 - e)e-i)i-^ - (1 - A)' C^'"^^' • 
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Here we made use of tg > 1, < e ^ and < C^*^ ^^"^^ This last 

expression is minimised with respect to A for A = , _-|^ — which 

1 + (1 - e)e 

substituted leads to 

5(fo)>(l + (l-e)e-i)-7(e) 
with /(e) = (1 - e)(l - C"') + (1 + (1 - e)e-^Y - (1 + e(l - e)e-i). Employing the 
expansion In^ = 2F^ (^^t) for 2; > [4] we obtain C, > (^^) 

fromwhereC"' < (^)' = {} - < 1- ^e(l-e)- Moreover, (1 + (1- 

e)e-i)^-(l + e(l-e)e-i) > -\e{l-efe-'^ since {l + zY-{l+qz) > \q{q-l)z'^ 
for 2 > and < g < 1. Thus, /(e) > |e(l - ef{l - (1 - e)e-2) > leading to 
9{to) > 0. □ 

Theorem 2. The l-margitron with < e < 1 converges in 

t^<{\A + B-e)(ln{^A + B-e)^ (19) 

updates, with A= (2 + e — 2[e])^ and B = (1 + e)— rq-?, io a solution hyperplane 

possessing directional margin 7^ w/iic/i is a fraction / 0/ the maximum directional 
margin 7,1 obeying the inequality 

Moreover, for < e < 1 an after-running estimate of ^ is obtainable from 

Id 

(21) 

Here the integer N > satisfies any of the constraints 



t^>N>'-±^{^y-\ t,>N{^^^y. (22) 

Obviously, the choice N = 1 is always acceptable. A near optimal choice of N is 
If 

2 iTTy 



A^opt 



1, provided it satisfies one of the above constraints. 
Proof. From (2) and taking into account (4) we get 

||a*+if - \\a,\f = \\y,\f + 2y, ■a,<R^ + 2h \\at\\'-' 
or, assuming t>l, 

\\at+if-\\atf < R\ I , 

1-1 - (- — own . 11-1 - e ~ " ■ 
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By using (6) in the r.h.s. of the above inequahty and (17) in its l.h.s. we obtain 



l+e 1+e - 27I 

a repeated application t — N times {t> N >1) oi which gives 



d l=N 

t-1 



1 

l+e 



<5:^ ^^"'+ / l'-'dl]+bt 

'd V l = N / 

I d 
Id 

Thus, employing the obvious bound ||ajv|| < RN, we are led to 

||a.|| < L.. . + ij? (A) _ Hsim) + (1 + 

(23) 

which, although derived for t > N, turns out to be satisfied even for t = N. 
Combining (6) with (23) we obtain 

ll+H^+^<\\at\\'+'<{L^,f+^ (24) 

from where 

< (^) t-^ (25) 

or, equivalently, 

9N{t) < (26) 

with 

= ^^-{tS^^ - ^ ^ (t^ - iV^ + (e - [e])iV-i) t-i - (1 + e) ^ . 

'(27) 

Let us consider the derivative of giq{t) 
where 
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Dis[{t) is strictly increasing and therefore has at most one root {DN{ti„) = 
0) where obviously gjf{t) acquires a mininumi (since gj^{t) is unbounded from 
above) with 5jv(triv) < (since g^iN) < 0). Thus, g^it) starts from negative 
values at t = N and with t increasing cither tends monotonically to infinity 
or decreases further until it acquires a minimum at t = ^^id then increases 
monotonically towards infinity. In both cases there is a single value of t for 
which 

5iv(*b^) = (28) 

and moreover for gj^{t) ^ 

sign(i-ib„) = sign(5^r(t)) • (29) 

The unique value of t for which (24), (25) and (26) hold as equalities provides 
an upper bound on the number of updates tc required for convergence. 
Combining (4), (5), (23), (27) and (28) we get 



f-ll> gftc) _ b > b > b _ b 



- R 



(30) 



For e = 1 the above lower bound on / is optimised for A'^ = 1 in which case it 
reduces to (20). For < e < 1 we may replace in the above lower bound on / 
first with 7^ and subsequently, on the condition that one of the constraints 
(22) is satisfied, tbw with tc since both replacements can be shown to loosen 
the bound. Thus, we obtain / > f^st with f^st given by (21). An approximate 
maximisation of /est with respect to A'' leads to the near optimal value A^opt of 
Theorem 2. 

Let us choose = 1 in (27) and replace ( J with thereby lowering 
the value of gi{t) 

g,{t) >t^- + A + 1 - N) I - (1 + ^)^- (31) 

By employing (18) in the r.h.s. of (31) we obtain 
g,{t) > m ^t^- ^(i±^t^(lnt)i- + ^(1 - [e]))^ - i^ + e)^- (32) 

For < e < 1 g{t) becomes a function of the type considered in Lemma 4 with 

a = (2 + e)^ > 2 + e. Obviously to of Lemma 4 satisfies ,91 (to) > .9(^0) > 

/d 

and according to (29) is an upper bound on tbi- Also for e = 1 g{t) becomes 
a function of the type considered in Lemma 1 and tb of Lemma 1 is an upper 
bound on tbi • Actually in this very special case tbi coincides with ty, (since (32) 
holds as an equality) which, in turn, coincides with its upper and lower bound. 
This, given that tc < ^bu completes the proof of (19). 
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Alternatively using -i + < 0, 1 - [e] < (1 - [e])i' and (l + e)(H-e- [e]) = 
(1 + e)2-[^] in the r.h.s. of (31) we obtain 

g,{t) > m ^t^-^{l + e)2-W^i-i _ (1 + . 

The function g{t) is of the type considered in Lemma 1 and its only root 
satisfying 

9{k) = (33) 

is an upper bound on the number of updates looser than i-c tbi < ib- 
Moreover, the upper bound on i^, from Lemma 1 is an alternative upper bound 
on tc- Combining (30) for A'' = 1, the inequality < tb and (33) we obtain 



(34) 

Additionally, = ^ (1 + e)^~['l ^ is a lower bound on % since it is lower than 

the lower bound inferred from Lemma 1. Replacing ib with its lower bound tihi 
in the r.h.s. of (34) we get the weaker bound (20). q 

Remark 3. The lower bounds (30) and (34) on the fraction / involving the un- 
known maximum margin 7^ are of great theoretical importance because they 
guarantee before running that the algorithm will achieve a margin which is a 
more substantial fraction of 7d than the one inferred from (20). As a consequence, 
values of the parameter h smaller than the ones inferred from (20) suffice in order 
for the before-running lower bound on the fraction / to be close to its asymptotic 
value (1 + e)~^. This is quantified in the following theorem. 



Theorem 3. The i-margitron with < e < 1 and b (at least as large as the 

one) given by 



(35) 



(5 > 0) converges in a finite number of updates to a solution hyperplane pos- 
sessing directional margin 7^ which is a fraction f of the maximum directional 
margin jd obeying the inequality 



J _ Ik fx J- ^ J- ^A-i 

Proof. Notice that 



/=Ii>(5 + l + e)-i . (36) 



b ye (l+e)^-t-l 



7^ ~ 2e<5 7: 



is a lower bound on ii, of (33) since it is lower than the lower bound inferred from 
Lemma 1. Replacing with its lower bound fib2 in the r.h.s. of (34) completes 
the proof. (Larger 6's may be regarded as corresponding to smaller (5's.) q 
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Remark 4- For e <C 1 a more accurate determination of b ensuring that (36) 
holds is obtained from ■^t7= (^f-) ' with w = ^(1 - e)(l + e-^)(2 + e)(l + 

e)^ln|^JeT^(l-e)(l + e-i)(2 + e)(l + e)^^^ and < (5 < e-i(l + 

e~^)(2 + e)^T- For such a h and taking into account the constraint on 6 it 

'd 

1 

_ / /i \ *- — f?^ 

can be verified that tib = I (1 + e)— rrr ) = (1 + e) satisfies the inequal- 



'7r7 ' ' ^ 7^ 

ity iib > e. Moreover, any possible root of g{t) defined in (32) and the single 
root of gi{t) arc necessarily larger than t\\^. Therefore, since fib > e and given 

d Int 



that > y-^^— ^ ^) for t > e there is a single root ib of g{t) satisfying 

th > ^bi > iib > e- Combining (30) for iV = 1 with the last inequality and the 
relation g{t]j) = wc get 

Let X = 4ei^(l - e)(l + e-M(2 + e)(l + e)^^. Then, = e"i^a;lna; 

Yd Id 

and 

(2 + e)^[^) '"' = (2 + e)(l + ef-^u;-' (1 ln(l + e) + In (o;^)) 

< (2 + e)(l + e)^.- (l + (1 - e) In (-f )) = (J^^^^ < S (37) 

(In Inx/lnx < e^^). Thus, our choice of b ensures that / > {6+1 + e)~^. Substi- 
tuting b into (19) we conclude that in the £-margitron as e, 5 ^ the upper bound 
on the number of updates ~ {e'^ + S'^ In d-^)\ii{e-^ + 6''^ In 6-^)R^ /-fl- For 
e ^ with S fixed, instead, the bound --^ lne~^i?^/7^. For 5 <C 1 and 
i5/e < A « 1, however, a more accurate upper bound on tc may be obtained by 
observing that 

= ,(fb) >tl-i2 + e)f (if.) - 4 >il-i2 + e)^{^) - 4 
from where (using also (37)) 

<4(l + Y^-^)' <*ib(l + '5)^ <iibee =e7(l + e)7w^ . 



Taking into account that tc < ibi < ^b we conclude that as (5 ^ with S/e 
bounded from above (e.g. 5 = e — > 0) the upper bound on tc ~ ln6~^R'^ /^\. 
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Theorem 4. There is a value of the parameter b for which the £-margitron with 
e ^ 1 converges to a solution hyperplane with directional margin 7^ > (1 — 2e)7d 
in less than ~ e~^lne~^R^/^^ updates. 

Proof. Set S = e in Remark 4 and notice that / > (1 + 2e)~^ > 1 — 2e. q 
Theorem 5. Both the t- and the £-margitron with 1 < e < 2 converge in tc 

updates, with tc hounded from, above by ^ + I 2 — 7^ I and ^ + ( w— -^+7 I 

respectively, to a solution hyperplane possessing directional margin 7^ which in 
the limit 6 — > 00 satisfies the inequality 7^ > (1 — f )7d- 

Proof. For the t-margitron the analysis of Theorem 1 that led to (13) remains 
valid and the single root th of g{t) still provides an upper bound on tc- The 
bound on tc stated in Theorem 5 is the upper bound on ti, inferred from Lemma 
1. The analysis that led to (9) remains also valid but we are no longer allowed 
to replace tc with its lower bound tc = I. Instead, we may replace tc in (9) with 
its upper bound stated in Theorem 5. Then, as 6 ^ 00 we get 7^ > (1 — f )7d- 

In the case of the ^-margitron at ■ y^. for a misclassified pattern y^, may 
be bounded from above by employing (4) and (6) as at ■ y^. < 6||£it|| ' < 
^{ldtY~^- Then, the analysis of Theorem 1 that led to (13) remains valid with 
the replacement of b by b^\~'^. The bound on tc stated in Theorem 5 is the upper 
bound on tb inferred from Lemma 1. For the fraction /, instead, employing (4), 
(5), (11) and (13), with the last two relations taken &t t = as equalities, we 
have 

1 

p2 / 2 b 

Replacing th with its upper bound + ( 211^ ^1 + ^ ) ™ above relation 



leads to a weaker bound from where we get limb^oo / ^ 1 — f • □ 



4 Experiments 

To reduce the computational cost we follow [17] and form a reduced "active set" 
of patterns consisting of the ones found misclassified during each epoch which 
are then cyclically presented to the Margitron algorithm for A^op mini-epochs 
unless no update occurs during a mini-epoch. Subsequently, a new full epoch 
involving all the patterns takes place giving rise to a new active set. The algo- 
rithm terminates only if no mistake occurs during a full epoch. This procedure 
clearly amounts to a different way of sequentially presenting the patterns to the 
algorithm and does not affect the applicability of our theoretical analysis. 

We compare the t- and the ^-margitron with SVMs on the basis of their 
ability to achieve fast convergence to a certain approximation of the "optimal" 
hyperplane in the feature space where the patterns are linearly separable. For 
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Table 1. Results of a comparative study of SVM', t-margitron and ^-margitron. 



data 
set 


A 


SVM' t = 0.01 


P 


iVcp 


t — margitron 


— margitron 


10-'7' 


Sees 


e 


w 


10-'7' 


Sees 


e 




10''7' 


Sees 


Adult 


1 


8.4899 


1810.3 





50 


0.001 


491.1 


8.4917 


72.2 


0.0005 


220.4 


8.4903 


68.2 


Web 


1 


20.941 


250.0 


0.1 


10 


0.2 


8400 


20.944 


17.6 


0.2 


1250 


20.942 


17.4 


Cll 


0.1 


1.7818 


6172.0 


0.1 


50 


0.2 


10000 


1.7822 


631.7 


0.2 


1600 


1.7821 


655.3 


CCAT 


0.1 


0.9016 


48235.0 


0.1 


50 


0.1 


514.2 


0.9016 


2324.8 


0.1 


285 


0.9018 


2369.7 


Cover 


10 


15.774 


47987.7 


1 


20 


0.01 


158.6 


15.774 


1866.1 


0.005 


121.7 


15.776 


1760.0 



linearly separable data the feature space is the initial instance space whereas for 
linearly inseparable data (which is the case here) a space extended by as many 
dimensions as the instances is considered where each instance is placed at a dis- 
tance A from the origin in the corresponding dimension. The extension generates 
a margin of at least Aj^Jn with n being the number of patterns and amounts to 
adding a term to the diagonal entries of the kernel (linear in our case) . More- 
over, its employment is justified by the well-known equivalence between the hard 
margin optimisation in the extended space and the soft margin optimisation in 
the initial instance space with objective function H'U'll^ + '^^^^^j^i^ involving the 
weight vector w and the 2-norm of the slacks [1] . We emphasize that SVMs 
and the Margitron are required to solve identical hard margin problems. 

In our experiments SVMs are represented by SYM''^*^* [5], denoted here as 
SVM', a decomposition method algorithm which is many orders of magnitude 
faster than standard SVMs. For SVM^ we choose a memory parameter m = 
400MB and a 1-norm soft margin parameter C = 10^ (approximating C = oo) 
since we are dealing with a hard margin problem in the appropriate feature 
space. The choice of the accuracy e depends on the case. For the remaining 
parameters default values are used. The experiments were conducted on a 1.8 
GHz Intel Pentium M processor with 504 MB RAM running Windows XP. The 
codes written in C++ were run using Microsoft's Visual C++ 5.0 compiler. 

The datascts we used for training arc the Adult (32561 instances, 123 bi- 
nary attributes) and Web (49749 instances, 300 binary attributes) UCI datasets 
as compiled by Piatt (see [5]), the testO set from the Reuters RCVl collection 
(199328 instances, 47236 attributes with average sparsity 0.16%) obtainable from 
http://www.jmlr.org/papers /volume5/lewis04a/lyrl2004_i:cvlv2_README.htm 
and the multiclass Covertype (Cover) UCI dataset (581012 instances, 54 at- 
tributes). In the case of the RCVl we considered both the Cll and the CCAT 
binary text classification tasks while in the case of the Covertype dataset we 
studied the binary classification problem of the first class versus all the others. 
The Covertype dataset was rescalcd by multiplying all the attributes with 0.001. 

In Table 1 we present the results (i.e. geometric margin 7' achieved and CPU 
sees needed) of our first comparative study involving the algorithms SVM\ t- 
margitron and ^-margitron together with the values of the parameters employed. 
A solution hyperplane in the extended space was first obtained using SVM' and 
subsequently the Margitron was required to obtain a solution of comparable 
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geometric margin. The extended space parameter A refers to both SVM' and 
the Margitron while tlic augmented space parameter p and the number of mini- 
epochs Nep only to the Margitron. Also, for the Margitron 7' is the geometric 
margin in the original (non-augmented) feature space with the augmentation 
providing for the bias. We see that the Margitron is at least 10-20 times faster 
than SVM^ on these rather large datasets. It is understood, of course, that some 
additional computer time was spent to locate the appropriate value of b. 

Recently SYM^'''^^ [6], a cutting-plane algorithm for training linear SVMs, 

was presented. We did make an attempt at including SYM^'^'^^ in our compara- 
tive study but we found that it requires a much longer CPU time to converge 
compared to SYM^ without even achieving as large values of the margin 7'. Ta- 
ble 2 contains our experimental results on the datasets Adult and Web {A = 1). 
Apparently, the "accuracy" e of SYM^'^'^* is not directly related to the fraction 
of the maximum margin achieved. 



Table 2. Results of experiments with SVMP'^''^ 



data 
sot 




f C 10-'",' Sees 


Adult 


.3 X 10 10" .5.9130 51150.:! 


Web 


2 X lO""" 10** 20.891 7297.9 



In Table 3 we present the directional margin 7^ achieved by the t- and the 
^-margitron together with the after-running estimate /est of the ratio j'^/jd and 
its asymptotic value for comparison. Let us accept that the geometric margin 7' 
reported in Table 1 is larger than 99% of the maximum geometric margin 7 as 
the accuracy e = 0.01 of SYM' suggests. Then, taking into account that 7 > 7d 
and that (7' — 7^)/7' < 0.02 we see that 7d/7d > 0.97. Thus, we may conclude 
that the estimates of Table 3 are certainly impressive given that they come from 
worst-case bounds which are not expected to be very tight and that they cannot, 
of course, exceed their asymptotic values. 



Table 3. The directional margin 7^ achieved by the t- and the ^-margitron together 
with the after- running estimate /est of the ratio 7d/7d and its asymptotic value. 



data 

sot 


t — margitron 


i — margitron 


10-'7d /est 1-f 


10-''7d Ut (l + e)-' 


Adult 


8.4917 0.9898 0.9995 


8.4903 0.9420 0.9995 


Web 


20.574 0.8645 0.9000 


20.573 0.7561 0.8333 


Cll 


1.7789 0.8923 0.9000 


1.7787 0.8156 0.8333 


COAT 


0.9016 0.9404 0.9500 


0.9018 0.8752 0.9091 


Cover 


15.714 0.9873 0.9950 


15.716 0.9703 0.9950 
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Table 4. A comparison between the ^-margitron (successive runnings) and SVM . 



data 
set 


(. — margitroii 




f = 1 


f = 0.1 




10^77 


Sees 


lo'-'b 




/est 




Sees 


e 


10^7' 


Sees 


Adult 


6.8839 


11.352 


3.2 


577 


8.3274 


0.838 


8.3274 


39.5 


0.055 


8.3257 


1178.9 


Web 


19.202 


29.840 


5.0 


551 


20.677 


0.864 


21.053 


44.0 


0.0031 


21.051 


291.2 


Cll 


1.5435 


2.5607 


121.7 


506 


1.7765 


0.863 


1.7798 


774.7 


0.012 


1.7798 


5952.0 


CCAT 


0.7800 


1.2701 


366.7 


270 


0.8989 


0.855 


0.8989 


1771.5 


0.0165 


0.8984 


42548.8 


Cover 


10.011 


19.500 


:!:il.5 


:i01 


11.671 


0.810 


11.735 


527. 1 


0.085 


11.718 


29 102.8 



From (16) and (35) it becomes apparent that the minimal value of b guar- 
anteeing the desired accuracy depends on the maximum directional margin 7d. 
Moreover, this dependence becomes increasingly crucial with decreasing e. This 
last observation prompts us to proceed to a determination of the large margin so- 
lution in successive runnings starting with the more insensitive to the value of 7d 
Margitron with e = 1 and gradually moving towards employing algorithms with 
smaller e's able to guarantee larger fractions of 7d. Each running in this process 
will provide us with an interval in which the value of 7d lies which, hopefully, 
will shrink as we move towards smaller e's. This information will then allow us to 
fix the value of h to be used in the next running. The lower bound on 7d will be 
the margin 7^ achieved. The upper bound 7^^^ will be provided by exploiting the 
after-running estimate /est of 7d/7d which gives 7^^^ = 7d/e^t • Alternatively, we 
may employ the upper bound on the number of updates tc required for conver- 
gence to obtain a value for 7^^. For e = 1 this gives = (l -|- 2b/ R^) tc^ 
which is usually lower than the upper bound {R^/b + 2) 7d on 7d obtained from 

7d/7d > (i?^/&-l-2) ^. This procedure may be followed using either the t- or 
the ^-margitron but in the former case we may encounter difficulties for e < 1/2 
due to the lack of the strong before-running guarantees stemming from (15). 

In Table 4 we present the results of a second comparative study between the £- 
margitron and SVM'. For the ^-margitron we followed the procedure of successive 
runnings that we just described involving only two stages with e values 1 and 0.1. 
The extended and augmented feature spaces were identical to the ones of Table 1 
and a common value A^ep = 50 was chosen for all datasets. Also, in the first stage 
(e = 1) we made the common choice b/R^ = 5 and obtained 7^^^ from the relation 

7^^ = R\/lltc^. Then, in the second stage (e = 0.1) we fixed b from (35) with 

l-e 

S = (7d/7d^) ^ < 1 (which eliminates the dependence of 6 on 7d) employing 
the 7"'' obtained in the first stage. This way we shift the uncertainty in 7d/7d'^ 
to the before-running accuracy 6 and rely on the after-running lower bound /est 
on 7d/7d to assess the accuracy actually achieved. We see that /est is well above 
0.8 for all datasets. A comparison with SVM' on solutions of comparable margin 
reveals that the f-margitron remains considerably faster even if the time spent 
to fix b is taken into account. 
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5 Conclusions 

We generalised the classical Perceptron algorithm with margin by constructing 
the Margitron, a family of incremental large margin classifiers all the members 
of which employ the original perceptron update. The Margitron consists of two 
classes, namely the i-margitron with algorithms involving explicitly the number 
of updates and the ^-margitron the members of which depend only on the length 
of the weight vector and as such lie closer in spirit to the Perceptron. We proved 
that as the parameter e decreases from 2 to the corresponding algorithms in 
both classes converge in a finite number of updates to hyperplanes possessing a 
guaranteed fraction of the maximum margin the largest possible value of which 
varies continuously in the interval (0, 1). The Perceptron with margin belongs 
to both classes and is associated with the middle point of the above intervals. 
Finally, our experimental comparative study between algorithms from the mar- 
gitron family and SVM''^*^* on tasks involving linear kernels and 2-norm soft 
margin revealed that the Margitron is a serious alternative to linear SVMs. 
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